Not All Character N-grams Are Created Equal: A Study in Authorship Attribution

نویسندگان

  • Upendra Sapkota
  • Steven Bethard
  • Manuel Montes-y-Gómez
  • Thamar Solorio
چکیده

Character n-grams have been identified as the most successful feature in both singledomain and cross-domain Authorship Attribution (AA), but the reasons for their discriminative value were not fully understood. We identify subgroups of character n-grams that correspond to linguistic aspects commonly claimed to be covered by these features: morphosyntax, thematic content and style. We evaluate the predictiveness of each of these groups in two AA settings: a single domain setting and a cross-domain setting where multiple topics are present. We demonstrate that character ngrams that capture information about affixes and punctuation account for almost all of the power of character n-grams as features. Our study contributes new insights into the use of n-grams for future AA work and other classification tasks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the Robustness of Authorship Attribution Based on Character N-gram Features

A number of independent authorship attribution studies have demonstrated the effectiveness of character n-gram features for representing the stylistic properties of text. However, the vast majority of these studies examined the simple case where the training and test corpora are similar in terms of genre, topic, and distribution of the texts. Hence, there are doubts whether such a simple and lo...

متن کامل

Maximal Repeats Enhance Substring-based Authorship Attribution

This article tackles the Authorship Attribution task according to the language independence issue. We propose an alternative of variable length character n-grams features in supervised methods: maximal repeats in strings. When character ngrams are by essence redundant, maximal repeats are a condensed way to represent any substring of a corpus. Our experiments show that the redundant aspect of n...

متن کامل

Authorship Attribution in Portuguese Using Character N-grams

For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experimen...

متن کامل

Exploring Word Embeddings and Character N-Grams for Author Clustering

We presented our system for PAN 2016 Author Clustering task. Our software used simple character n-grams to represent the document collection. We then ran K-Means clustering optimized using the Silhouette Coefficient. Our system yields competitive results and required only a short runtime. Character n-grams can capture a wide range of information, making them effective for authorship attribution...

متن کامل

Local Histograms of Character N-grams for Authorship Attribution

This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). LHs are enriched histogram representations that preserve sequential information in documents; they have been successfully used for text categorization and document visualization using word histograms. In this work we explore the suitability of LHs over n-grams at the character-level for ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015